Group Members: Matthew Dang, Vincent Cheng, Brian Chung
For this tutorial, we decided to analyzed the US pollution Data from 2000 - 2016. The data was gather from the US EPA and from various site across America. The data focus on four different pollutants, Nitrogen Dioxide(NO2, Sulphur Dioxide(SO2), Carbon Monoxide(CO) and Ozone(O3). As climate change has become a reoccuring issue in the international community. We wanted to see how america's effort has reduced climate change. We hope to find possible trends in data that could highlight current changes in reducing pollution. The data provide only contains pollution values from EPA locations

The First part of the process is to Data Curation and Parsing. We need to import our dataset which can be found here https://www.kaggle.com/sogun3/uspollution/data. Data is never clean and needs to modified becuase of the missing data or additional infromation. This could mean removing information thats not relevant for our analysis, NaN values, or restructuring our dataframe to help our analysis later on. This will allow us to have clearer understanding of the what the data is proving and easier time parsing, analyzing, and later visualizing the data. We followed the principles of tidy data. If you want more information about tidy data, here is a link(https://www.jstatsoft.org/article/view/v059i10/v59i10.pdf).
First we will import some libraries that we will need for our analysis.
!pip install ggplot
!pip install plotly
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib
matplotlib.style.use('ggplot')
from ggplot import *
import plotly.plotly as py
import warnings
warnings.filterwarnings("ignore")
data_raw = pd.read_csv("data/pollution_us_2000_2016.csv")
data_raw.head()
The Data is imported and stored using a Pandas module. Pandas is a common module used for storing data and manipulating data. Pandas is useful for modifying and editing the data. Then, the date is plotted and analyzed for possible trends. For more information on pandas and how to use pandas, here is a link (http://www.gregreda.com/2013/10/26/working-with-pandas-dataframes/). Here is also a short tutorial about the basic functions of panada (https://www.dataquest.io/blog/pandas-python-tutorial/).
# remove any rows containg NaN values
data_raw = data_raw.dropna()
data_raw.head()
# view our variables
list(data_raw.columns)
There are some uncessary variables here that we don't need so we can remove them.
drop = ['Unnamed: 0',
'State Code',
'County Code',
'Site Num',
'Address',
'County',
'City',
'NO2 Units',
'NO2 1st Max Value',
'NO2 1st Max Hour',
'O3 Units',
'O3 1st Max Value',
'O3 1st Max Hour',
'SO2 Units',
'SO2 1st Max Value',
'SO2 1st Max Hour',
'CO Units',
'CO 1st Max Value',
'CO 1st Max Hour']
data = data_raw.drop(drop, axis = 1)
data.head()
We can also change the states to their abbreviations. This will make our graphs more readable later on.
us_state_abbrev = {
'Alabama': 'AL',
'Alaska': 'AK',
'Arizona': 'AZ',
'Arkansas': 'AR',
'California': 'CA',
'Colorado': 'CO',
'Connecticut': 'CT',
'Delaware': 'DE',
'Florida': 'FL',
'Georgia': 'GA',
'Hawaii': 'HI',
'Idaho': 'ID',
'Illinois': 'IL',
'Indiana': 'IN',
'Iowa': 'IA',
'Kansas': 'KS',
'Kentucky': 'KY',
'Louisiana': 'LA',
'Maine': 'ME',
'Maryland': 'MD',
'Massachusetts': 'MA',
'Michigan': 'MI',
'Minnesota': 'MN',
'Mississippi': 'MS',
'Missouri': 'MO',
'Montana': 'MT',
'Nebraska': 'NE',
'Nevada': 'NV',
'New Hampshire': 'NH',
'New Jersey': 'NJ',
'New Mexico': 'NM',
'New York': 'NY',
'North Carolina': 'NC',
'North Dakota': 'ND',
'Ohio': 'OH',
'Oklahoma': 'OK',
'Oregon': 'OR',
'Pennsylvania': 'PA',
'Rhode Island': 'RI',
'South Carolina': 'SC',
'South Dakota': 'SD',
'Tennessee': 'TN',
'Texas': 'TX',
'Utah': 'UT',
'Vermont': 'VT',
'Virginia': 'VA',
'Washington': 'WA',
'West Virginia': 'WV',
'Wisconsin': 'WI',
'Wyoming': 'WY',
}
data['state'] = data['State'].map(us_state_abbrev)
# Make column names without spaces to use in lin reg.
data['NO2Mean'] = data['NO2 Mean']
data['NO2AQI'] = data['NO2 AQI']
data['O3Mean'] = data['O3 Mean'] * 1000
data['O3AQI'] = data['O3 AQI']
data['SO2Mean'] = data['SO2 Mean']
data['SO2AQI'] = data['SO2 AQI']
data['COMean'] = data['CO Mean'] * 1000
data['COAQI'] = data['CO AQI']
# data = data.drop('State', 1)
# data = data.drop('NO2 Mean', 1)
# data = data.drop('NO2 AQI', 1)
# data = data.drop('O3 Mean', 1)
# data = data.drop('O3 AQI', 1)
# data = data.drop('SO2 Mean', 1)
# data = data.drop('SO2 AQI', 1)
# data = data.drop('CO Mean', 1)
# data = data.drop('CO AQI', 1)
data
For our analysis we are interested in changes by year so we will parse just the year from each date.
data['year'] = data['Date Local'].apply(lambda x: x[0:4])
data.drop('Date Local',1)
data_avg = data.groupby(['state', 'year'],as_index=False).mean()
data_avg.head()
We will use pyplot which is a MATLAB like plotting framework. Matplotlib has thorough documentation and lots of sample code if you want to get started. Here is a link to a basic tutorial on using pyplot: https://matplotlib.org/2.0.2/users/pyplot_tutorial.html. There a multiple of ways to visualize data. Visualization provide readers with a easily understandable view of a large set of data. It is important that the data is represented in a accurate and presentable way. Here is more information about Data Visualization and basics principle to follow when representing data,
We will first look at the change in the change in levels from the various pollutants.
plt.figure(figsize=(20, 15))
for state, grp in data_avg.groupby(['state']):
plt.plot(grp['year'], grp['SO2 Mean'], label=state)
plt.xlabel("Year")
plt.ylabel("SO2 Mean")
plt.title("SO2 Mean vs Year")
plt.legend(loc='best')
plt.show()
plt.figure(figsize=(20, 15))
for state, grp in data_avg.groupby(['state']):
plt.plot(grp['year'], grp['CO Mean'], label=state)
plt.xlabel("Year")
plt.ylabel("CO Mean")
plt.title("CO Mean vs Year")
plt.legend(loc='best')
plt.show()
plt.figure(figsize=(20, 15))
for state, grp in data_avg.groupby(['state']):
plt.plot(grp['year'], grp['NO2 Mean'], label=state)
plt.xlabel("Year")
plt.ylabel("NO2 Mean")
plt.title("NO2 Mean vs Year")
plt.legend(loc='best')
plt.show()
plt.figure(figsize=(20, 15))
for state, grp in data_avg.groupby(['state']):
plt.plot(grp['year'], grp['O3 Mean'], label=state)
plt.xlabel("Year")
plt.ylabel("O3 Mean")
plt.title("O3 Mean vs Year")
plt.legend(loc='best')
plt.show()
From these plots we can see tha SO2 and CO mean concentration generally decrease over time. However, NO2 and O3 mean concentration don't follow a linear trend and fluctuates over time.
We will next look at the change in air quality index over time for each pollutant
plt.figure(figsize=(20, 15))
for state, grp in data_avg.groupby(['state']):
plt.plot(grp['year'], grp['SO2 AQI'], label=state)
plt.xlabel("Year")
plt.ylabel("SO2 AQI")
plt.title("SO2 AQI vs Year")
plt.legend(loc='best')
plt.show()
plt.figure(figsize=(20, 15))
for state, grp in data_avg.groupby(['state']):
plt.plot(grp['year'], grp['CO AQI'], label=state)
plt.xlabel("Year")
plt.ylabel("CO AQI")
plt.title("CO AQI vs Year")
plt.legend(loc='best')
plt.show()
plt.figure(figsize=(20, 15))
for state, grp in data_avg.groupby(['state']):
plt.plot(grp['year'], grp['NO2 AQI'], label=state)
plt.xlabel("Year")
plt.ylabel("NO2 AQI")
plt.title("NO2 AQI vs Year")
plt.legend(loc='best')
plt.show()
plt.figure(figsize=(20, 15))
for state, grp in data_avg.groupby(['state']):
plt.plot(grp['year'], grp['O3 AQI'], label=state)
plt.xlabel("Year")
plt.ylabel("O3 AQI")
plt.title("O3 AQI vs Year")
plt.legend(loc='best')
plt.show()
From these plots we can see that SO2 AQI and CO AQI have a negative linear relationship with year. CO AQI and NO2 AQI also generally decrease over time, but fluctuates much more than SO2 AQI and CO AQI.
The previous plots are a bit complicated to read, but we can use other forms of data visualizations to analyze the data more clearly. Another way to visualize this data is with a violin plot. A violin plot is used to visualize the distribution of data and its proability density. This plot is a combination of a box plot and a density plot. We will plot pollutant concentration over time and pollutant air quality index over time. We will use the ggplot package to plot the violin plots. ggplot is a plotting system for Python based on R's ggplot2. This package is used for making intelligent plots fast with minimal code. You can refer to the documentation and tutorials on how to use ggplot here: http://ggplot.yhathq.com/docs/index.html
f1=ggplot(aes(x='year', y='SO2 Mean'), data=data_avg) +\
geom_violin() +\
labs(title="SO2 Mean over time",
x = "year",
y = "SO2 Mean")
f2=ggplot(aes(x='year', y='CO Mean'), data=data_avg) +\
geom_violin() +\
labs(title="CO Mean over time",
x = "year",
y = "CO Mean")
f3=ggplot(aes(x='year', y='NO2 Mean'), data=data_avg) +\
geom_violin() +\
labs(title="NO2 Mean over time",
x = "year",
y = "NO2 Mean")
f4=ggplot(aes(x='year', y='O3 Mean'), data=data_avg) +\
geom_violin() +\
labs(title="O3 Mean over time",
x = "year",
y = "O3 Mean")
f5=ggplot(aes(x='year', y='SO2 AQI'), data=data_avg) +\
geom_violin() +\
labs(title="SO2 AQI over time",
x = "year",
y = "SO2 AQI")
f6=ggplot(aes(x='year', y='CO AQI'), data=data_avg) +\
geom_violin() +\
labs(title="CO AQI over time",
x = "year",
y = "CO AQI")
f7=ggplot(aes(x='year', y='NO2 AQI'), data=data_avg) +\
geom_violin() +\
labs(title="NO2 AQI over time",
x = "year",
y = "NO2 AQI")
f8=ggplot(aes(x='year', y='O3 AQI'), data=data_avg) +\
geom_violin() +\
labs(title="O3 AQI over time",
x = "year",
y = "O3 AQI")
ggplot.show(f1)
ggplot.show(f2)
ggplot.show(f3)
ggplot.show(f4)
ggplot.show(f5)
ggplot.show(f6)
ggplot.show(f7)
ggplot.show(f8)
From the CO and SO2 Mean violin plots, we can see that in 2000 the distribution has a lot of variation but over the years variation decreases. However, the NO2 and O3 mean violin plots have high variation all throughout 2000 - 2016 and stay fairly constant. The AQI vs Year violin plots show similar respective results as the pollutant mean violin plots.
Another way to visualize this data is with a choropleth map. A choropleth map is a thematic map which areas are shaded in proportion to the measurement of the statistic being displayed on the map. We believe this provide a unique way to compare the different level between states and give readers a better understanding of the current trends in pollution levels. In this case, the map is of the US and each state is shaded in proportion to the average concentration of each pollutant. The higher the concentration the darker the shade. Below displays the choropleth maps of years 2000, 2008, and 2016. We will be using the plotly package which requires an API key. You can register for an API key here: https://plot.ly/. Once you have an API key you can set your credentials in this line:
plotly.tools.set_credentials_file(username='YOUR_USERNAME', api_key='YOUR_API_KEY')
Plotly provides detailed tutorials/sample code and documentation. You can check them out here: https://help.plot.ly/tutorials/
import plotly
import plotly.plotly as py
from plotly.graph_objs import *
# plotly.tools.set_credentials_file(username='YOUR_USERNAME', api_key='YOUR_API_KEY')
# choropleth map of CO Levels in US in 2000
data_2000 = data_avg[data_avg['year'] == '2000']
data_2008 = data_avg[data_avg['year'] == '2008']
data_2016 = data_avg[data_avg['year'] == '2016']
data_2000['text'] = data_2000['state'] + " has a SO2 Mean level of " + data_2000['SO2 Mean'].astype(str)
data_2008['text'] = data_2008['state'] + " has a SO2 Mean level of " + data_2008['SO2 Mean'].astype(str)
data_2016['text'] = data_2016['state'] + " has a SO2 Mean level of " + data_2016['SO2 Mean'].astype(str)
scl = [[0.0, 'rgb(242,240,247)'],[0.2, 'rgb(218,218,235)'],[0.4, 'rgb(188,189,220)'],\
[0.6, 'rgb(158,154,200)'],[0.8, 'rgb(117,107,177)'],[1.0, 'rgb(84,39,143)']]
map_data = [ dict(
type='choropleth',
colorscale = scl,
autocolorscale = False,
locations = data_2000['state'],
z = data_2000['SO2 Mean'].astype(float),
locationmode = 'USA-states',
text = data_2000['text'],
marker = dict(
line = dict (
color = 'rgb(255,255,255)',
width = 2
)
),
colorbar = dict(
title = "Parts Per Billion"
)
) ]
layout = dict(
title = '2000 US Sulfur Dioxide Pollution Levels by State<br>(Hover for breakdown)',
geo = dict(
scope='usa',
projection=dict( type='albers usa' ),
showlakes = True,
lakecolor = 'rgb(255, 255, 255)',
),
)
fig = dict(data=map_data, layout=layout)
py.iplot(fig)
# choropleth map of CO Levels in US in 2008
map_data[0]["z"] = data_2008['SO2 Mean'].astype(float)
map_data[0]["locations"] = data_2008['state']
map_data[0]["text"] = data_2008['text']
layout["title"] = '2008 US Sulfur Dioxide Pollution Levels by State<br>(Hover for breakdown)'
fig = dict(data=map_data, layout=layout)
py.iplot(fig)
# choropleth map of CO Levels in US in 2016
map_data[0]["z"] = data_2016['SO2 Mean'].astype(float)
map_data[0]["locations"] = data_2016['state']
map_data[0]["text"] = data_2016['text']
layout["title"] = '2016 US Sulfur Dioxide Pollution Levels by State<br>(Hover for breakdown)'
fig = dict(data=map_data, layout=layout)
py.iplot(fig)
In 2000, Virginia has the highest Sulfur Dioxide Concentration. In 2008, New York has the highest level and in 2016 Ohio has the highest level.
data_2000['text'] = data_2000['state'] + " has a CO Mean level of " + data_2000['CO Mean'].astype(str)
data_2008['text'] = data_2008['state'] + " has a CO Mean level of " + data_2008['CO Mean'].astype(str)
data_2016['text'] = data_2016['state'] + " has a CO Mean level of " + data_2016['CO Mean'].astype(str)
scl = [[0.0, 'rgb(242,240,247)'],[0.2, 'rgb(218,218,235)'],[0.4, 'rgb(188,189,220)'],\
[0.6, 'rgb(158,154,200)'],[0.8, 'rgb(117,107,177)'],[1.0, 'rgb(84,39,143)']]
map_data = [ dict(
type='choropleth',
colorscale = scl,
autocolorscale = False,
locations = data_2000['state'],
z = data_2000['CO Mean'].astype(float),
locationmode = 'USA-states',
text = data_2000['text'],
marker = dict(
line = dict (
color = 'rgb(255,255,255)',
width = 2
)
),
colorbar = dict(
title = "Parts Per Billion"
)
) ]
layout = dict(
title = '2000 US Carbon Monoxide Pollution Levels by State<br>(Hover for breakdown)',
geo = dict(
scope='usa',
projection=dict( type='albers usa' ),
showlakes = True,
lakecolor = 'rgb(255, 255, 255)',
),
)
fig = dict(data=map_data, layout=layout)
py.iplot(fig)
map_data[0]["z"] = data_2008['CO Mean'].astype(float)
map_data[0]["locations"] = data_2008['state']
map_data[0]["text"] = data_2008['text']
layout["title"] = '2008 US Carbon Monoxide Pollution Levels by State<br>(Hover for breakdown)'
fig = dict(data=map_data, layout=layout)
py.iplot(fig)
map_data[0]["z"] = data_2016['CO Mean'].astype(float)
map_data[0]["locations"] = data_2016['state']
map_data[0]["text"] = data_2016['text']
layout["title"] = '2016 US Carbon Monoxide Pollution Levels by State<br>(Hover for breakdown)'
fig = dict(data=map_data, layout=layout)
py.iplot(fig)
In 2000, Indiana has the highest Carbon Dioxide Concentration. In 2008, Arkansas has the highest level and in 2016 Florida has the highest level.
data_2000['text'] = data_2000['state'] + " has a O3 Mean level of " + data_2000['O3 Mean'].astype(str)
data_2008['text'] = data_2008['state'] + " has a O3 Mean level of " + data_2008['O3 Mean'].astype(str)
data_2016['text'] = data_2016['state'] + " has a O3 Mean level of " + data_2016['O3 Mean'].astype(str)
scl = [[0.0, 'rgb(242,240,247)'],[0.2, 'rgb(218,218,235)'],[0.4, 'rgb(188,189,220)'],\
[0.6, 'rgb(158,154,200)'],[0.8, 'rgb(117,107,177)'],[1.0, 'rgb(84,39,143)']]
map_data = [ dict(
type='choropleth',
colorscale = scl,
autocolorscale = False,
locations = data_2000['state'],
z = data_2000['O3 Mean'].astype(float),
locationmode = 'USA-states',
text = data_2000['text'],
marker = dict(
line = dict (
color = 'rgb(255,255,255)',
width = 2
)
),
colorbar = dict(
title = "Parts Per Billion"
)
) ]
layout = dict(
title = '2000 US Ozone Pollution Levels by State<br>(Hover for breakdown)',
geo = dict(
scope='usa',
projection=dict( type='albers usa' ),
showlakes = True,
lakecolor = 'rgb(255, 255, 255)',
),
)
fig = dict(data=map_data, layout=layout)
py.iplot(fig)
map_data[0]["z"] = data_2008['O3 Mean'].astype(float)
map_data[0]["locations"] = data_2008['state']
map_data[0]["text"] = data_2008['text']
layout["title"] = '2008 US Ozone Pollution Levels by State<br>(Hover for breakdown)'
fig = dict(data=map_data, layout=layout)
py.iplot(fig)
map_data[0]["z"] = data_2016['O3 Mean'].astype(float)
map_data[0]["locations"] = data_2016['state']
map_data[0]["text"] = data_2016['text']
layout["title"] = '2016 US Ozone Pollution Levels by State<br>(Hover for breakdown)'
fig = dict(data=map_data, layout=layout)
py.iplot(fig)
In 2000, North Carolina has the highest Ozone Concentration. In 2008 and 2016, Wyoming has the highest level.
data_2000['text'] = data_2000['state'] + " has a NO2 Mean level of " + data_2000['O3 Mean'].astype(str)
data_2008['text'] = data_2008['state'] + " has a NO2 Mean level of " + data_2008['O3 Mean'].astype(str)
data_2016['text'] = data_2016['state'] + " has a NO2 Mean level of " + data_2016['O3 Mean'].astype(str)
scl = [[0.0, 'rgb(242,240,247)'],[0.2, 'rgb(218,218,235)'],[0.4, 'rgb(188,189,220)'],\
[0.6, 'rgb(158,154,200)'],[0.8, 'rgb(117,107,177)'],[1.0, 'rgb(84,39,143)']]
map_data = [ dict(
type='choropleth',
colorscale = scl,
autocolorscale = False,
locations = data_2000['state'],
z = data_2000['NO2 Mean'].astype(float),
locationmode = 'USA-states',
text = data_2000['text'],
marker = dict(
line = dict (
color = 'rgb(255,255,255)',
width = 2
)
),
colorbar = dict(
title = "Parts Per Billion"
)
) ]
layout = dict(
title = '2000 US Nitrogen Dioxide Pollution Levels by State<br>(Hover for breakdown)',
geo = dict(
scope='usa',
projection=dict( type='albers usa' ),
showlakes = True,
lakecolor = 'rgb(255, 255, 255)',
),
)
fig = dict(data=map_data, layout=layout)
py.iplot(fig)
map_data[0]["z"] = data_2008['NO2 Mean'].astype(float)
map_data[0]["locations"] = data_2008['state']
map_data[0]["text"] = data_2008['text']
layout["title"] = '2008 US Nitrogen Dioxide Pollution Levels by State<br>(Hover for breakdown)'
fig = dict(data=map_data, layout=layout)
py.iplot(fig)
map_data[0]["z"] = data_2016['NO2 Mean'].astype(float)
map_data[0]["locations"] = data_2016['state']
map_data[0]["text"] = data_2016['text']
layout["title"] = '2016 US Nitrogen Dioxide Levels by State<br>(Hover for breakdown)'
fig = dict(data=map_data, layout=layout)
py.iplot(fig)
In 2000, Arizona has the highest Nitrogen Dioxide Concentration. In 2008, Massachusetts the highest level and in 2016 Utah has the highest level.
from sklearn import linear_model
import statsmodels.formula.api as smf
import statsmodels.api as sm
from sklearn import linear_model
from sklearn.linear_model import SGDRegressor
from sklearn.preprocessing import StandardScaler
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import ExtraTreesRegressor
from sklearn.ensemble import RandomForestRegressor
from IPython.core import display as ICD
from IPython.display import display, HTML
#NO2 Lin reg model
# Get a lin reg model
regr = linear_model.LinearRegression()
# Fit the model to the data
regr.fit(data['NO2Mean'].values.reshape(-1, 1), data['NO2AQI'])
# Get predicted live expectancies based on the model
Predicted_NO2_AQI = regr.predict(data['NO2Mean'].values.reshape(-1, 1))
# Gradient Descent Regression
SGD = SGDRegressor(penalty='elasticnet')
SGD = SGD.fit(data['NO2Mean'].values.reshape(-1, 1), data['NO2AQI'])
SGDPrediction=SGD.predict(data['NO2Mean'].values.reshape(-1, 1))
# Plot data and the lin reg
plt.scatter(data['NO2Mean'], data['NO2AQI'], color='lightgreen')
plt.plot(data['NO2Mean'], Predicted_NO2_AQI, color='darkblue', linewidth=1, label="Linear Regression")
plt.plot(data['NO2Mean'], SGDPrediction, color='darkred', linewidth=1, label='Gradient Descent')
plt.legend()
plt.show()
regression = smf.ols(formula='NO2AQI ~ NO2Mean', data=data).fit()
print(regression.summary())
print("Estimated R^2 Value For Linear Regression="+ str(regr.score(data['NO2Mean'].values.reshape(-1, 1), data['NO2AQI'])))
print("Estimated R^2 Value For Gradient Descent="+ str(SGD.score(data['NO2Mean'].values.reshape(-1, 1), data['NO2AQI'])))
#O3 Lin reg model
from sklearn import linear_model
# Get a lin reg model
regr = linear_model.LinearRegression()
# Fit the model to the data
regr.fit(data['O3Mean'].values.reshape(-1, 1), data['O3AQI'])
# Get predicted live expectancies based on the model
Predicted_O3_AQI = regr.predict(data['O3Mean'].values.reshape(-1, 1))
# Gradient Descent Regression
SGD = SGDRegressor(penalty='elasticnet')
SGD = SGD.fit(data['O3Mean'].values.reshape(-1, 1), data['O3AQI'])
SGDPrediction=SGD.predict(data['O3Mean'].values.reshape(-1, 1))
# Plot data and the lin reg
plt.scatter(data['O3Mean'], data['O3AQI'], color='lightgreen')
plt.plot(data['O3Mean'], Predicted_O3_AQI, color='darkblue', linewidth=1,label='Linear Regression')
plt.plot(data['O3Mean'], SGDPrediction, color='darkred', linewidth=1, label='Gradient Descent')
plt.legend()
plt.show()
regression = smf.ols(formula='O3AQI ~ O3Mean', data=data).fit()
print(regression.summary())
print("Estimated R^2 Value For Linear Regression="+ str(regr.score(data['O3Mean'].values.reshape(-1, 1), data['O3AQI'])))
print("Estimated R^2 Value For Gradient Descent="+ str(SGD.score(data['O3Mean'].values.reshape(-1, 1), data['O3AQI'])))
#SO2 Lin reg model
from sklearn import linear_model
# Get a lin reg model
regr = linear_model.LinearRegression()
# Fit the model to the data
regr.fit(data['SO2Mean'].values.reshape(-1, 1), data['SO2AQI'])
# Get predicted live expectancies based on the model
regr.fit(data['SO2Mean'].values.reshape(-1, 1), data['SO2AQI'])
Predicted_SO2_AQI = regr.predict(data['SO2Mean'].values.reshape(-1, 1))
# Gradient Descent Regression
SGD = SGDRegressor(penalty='elasticnet')
SGD = SGD.fit(data['SO2Mean'].values.reshape(-1, 1), data['SO2AQI'])
SGDPrediction=SGD.predict(data['SO2Mean'].values.reshape(-1, 1))
# Plot data and the lin reg
plt.scatter(data['SO2Mean'], data['SO2AQI'], color='lightgreen')
plt.plot(data['SO2Mean'], Predicted_SO2_AQI, color='darkblue', linewidth=1,label='Linear Regression')
plt.plot(data['SO2Mean'], SGDPrediction, color='darkred', linewidth=1, label='Gradient Descent')
plt.legend()
plt.show()
# print coef, intercept, R^2
regression = smf.ols(formula='SO2AQI ~ SO2Mean', data=data).fit()
print(regression.summary())
print("Estimated R^2 Value For Linear Regression="+ str(regr.score(data['SO2Mean'].values.reshape(-1, 1), data['SO2AQI'])))
print("Estimated R^2 Value For Gradient Descent="+ str(SGD.score(data['SO2Mean'].values.reshape(-1, 1), data['SO2AQI'])))
#CO Lin reg model
from sklearn import linear_model
# Get a lin reg model
regr = linear_model.LinearRegression()
# Fit the model to the data
regr.fit(data['CO Mean'].values.reshape(-1, 1), data['CO AQI'])
# Get predicted live expectancies based on the model
Predicted_CO_AQI = regr.predict(data['CO Mean'].values.reshape(-1, 1))
regr.fit(data['CO Mean'].values.reshape(-1, 1), data['CO AQI'])
## Gradient Descent Predictor
SGD = SGDRegressor()
SGD = SGD.fit(data['CO Mean'].values.reshape(-1, 1), data['CO AQI'])
SGDPrediction=SGD.predict(data['CO Mean'].values.reshape(-1, 1))
# Plot data and the lin reg
plt.scatter(data['CO Mean'], data['CO AQI'], color='lightgreen')
plt.plot(data['CO Mean'], Predicted_CO_AQI, color='darkblue', linewidth=1, label='Linear Regression')
plt.plot(data['CO Mean'], SGDPrediction, color='darkred', linewidth=1, label='Gradient Descent')
plt.legend()
plt.show()
# print coef, intercept, R^2
regression = smf.ols(formula='COAQI ~ COMean', data=data).fit()
print(regression.summary())
print("Estimated R^2 Value For Linear Regression="+ str(regr.score(data['CO Mean'].values.reshape(-1, 1), data['CO AQI'])))
print("Estimated R^2 Value For Gradient Descent="+ str(SGD.score(data['CO Mean'].values.reshape(-1, 1), data['CO AQI'])))
In conclusion, America has reduced it's pollution output in the last decade. There is also less variation in pollutant levels over time which indicates that the US has been regulating pollution reduction across the country. From this dataset we were able to reach this conclusion, but the data was highly inconsistent and was missing a lot of data. Not all of the states has data for all the years from 2000 - 2016 which makes it harder to make an accurate analysis. But from the provided information, the see a decrease in NO2, CO, and SO2 from 2000 to 2016; however, O3 has fluctuated and has not really decrease over this period. From the Regression, there is definitely some correlation between AQI and concentration. We expect a higher correlation is some of the variable like SO2 and O3. We hope this tutorial was thorough and gave you some understanding and experience working with python, data science involved packages like pandas, numpy, and matplotlib, and the data science process.